Day13 - Run first Delta Live Tables pipeline - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 13

AI & Data

利用 Databricks 學習 ML/LLM 開發系列第 13 篇

Day13 - Run first Delta Live Tables pipeline

15th鐵人賽

jimmyliao

2023-09-28 15:22:17

429 瀏覽

分享至

此篇是參考 Referenece 1. 的內容，實際操作一次 Delta Live Tables 的 pipeline。

0. Prerequisites

必須要有權限可以建立 cluster，或是有權限可以使用一個已經定義好的 cluster policy。因為 Delta Live Tables 會在執行 pipeline 之前，先建立一個 cluster，如果你沒有權限建立 cluster，那就會失敗。

Example Python Notebook 可以從這裡下載。


# Imports
import dlt
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Ingest raw clickstream data
json_path = "/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-json/2015_2_clickstream.json"
@dlt.create_table(
  comment="The raw wikipedia clickstream dataset, ingested from /databricks-datasets."
)
def clickstream_raw():          
  return (
    spark.read.json(json_path)
  )

# Clean and prepare data
@dlt.table(
  comment="Wikipedia clickstream data cleaned and prepared for analysis."
)
@dlt.expect("valid_current_page_title", "current_page_title IS NOT NULL")
@dlt.expect_or_fail("valid_count", "click_count > 0")
def clickstream_prepared():
  return (
    dlt.read("clickstream_raw")
      .withColumn("click_count", expr("CAST(n AS INT)"))
      .withColumnRenamed("curr_title", "current_page_title")
      .withColumnRenamed("prev_title", "previous_page_title")
      .select("current_page_title", "click_count", "previous_page_title")
  )

# Top referring pages
@dlt.table(
  comment="A table containing the top pages linking to the Apache Spark page."
)
def top_spark_referrers():
  return (
    dlt.read("clickstream_prepared")
      .filter(expr("current_page_title == 'Apache_Spark'"))
      .withColumnRenamed("previous_page_title", "referrer")
      .sort(desc("click_count"))
      .select("referrer", "click_count")
      .limit(10)
  )

1. Create a pipeline

建立一個 pipeline，也就是定義一個 notebook 或是 source code，並且在內容中使用 Delta Live Tables syntax 來定義 dependencies。注意每個 source code 只能是一種程式語言，但是可以在 pipeline 中透過 libraries 混用不同的程式語言。